Proposal for an MMX C-Interface

Introduction

The objective of this proposal is to provide a high level interface for programmers using lcc2 for accessing all new MMX instructions.

The MMX instruction set is accessible through intrinsic functions, that are recognized and inlined by the compiler.

The data type used by all MMX intrinsics is an 8 byte union, described in ‘mmx.h’. The interface is designed to work at maximum speed when vectors of this datatype are used. The internal loop necessary to apply the given operation to all elements of the data vectors is generated in-line. The dimensions of both arrays should be identical.

Scalar extension is provided, i.e. one of the inputs to the MMX intrinsics can be a scalar, that will be automatically extended by the compiler to apply the mmx operation to all elements of the input vector.

Since the MMX instructions and floating point instructions are incompatible, it is assumed that a function does not mix floating point and mmx. An emms instruction will be issued in the function epilogue if the mmx instruction set is used.

Obviously, the assembler interface is still available, and assembler instructions can be used direcly. In this case, it is the programmer’s responsability to issue the ‘emms’ instruction.

INSTRUCTION SYNTAX

Instructions vary by:

· Data type: packed bytes, packed words, packed doublewords or quadwords

· Signed - Unsigned numbers

· Wraparound - Saturate arithmetic

· Scalar/Vector data

A typical MMX instruction has this syntax:

· Prefix:

· ‘_’ to indicate that this is a compiler reserved word.

· ‘p’ for Packed, as Intel suggests.[1]

· Instruction operation: for example - ADD, CMP, or XOR

· Suffix:

· US for Unsigned Saturation

· S for Signed saturation

· B, W, D, Q for the data type: packed byte, packed word, packed doubleword, or quadword.

· ‘i’ for ‘immediate’ (scalar) data. If this suffix is not present, the function operates over two arrays.

Description of the interface

Pack with signed saturation

The pack operation operates with words (packed to bytes) or with dwords (packed to words).

void _stdcall _packsswb(_mmxdata *array1,_mmxdata *array2,int n);

Description

Each element of array1 will be packed with the corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array1[n](7..0) = SaturateSignedWordToSignedByte array1[n](15..0);

array1[n](15..8) = SaturateSignedWordToSignedByte array1[n](31..16);

array1[n](23..16) = SaturateSignedWordToSignedByte array1[n](47..32);

array1[n](31..24) = SaturateSignedWordToSignedByte array1[n](63..48);

array1[n](39..32) = SaturateSignedWordToSignedByte array2[n](15..0);

array1[n](47..40) = SaturateSignedWordToSignedByte array2[n](31..16);

array1[n](55..48) = SaturateSignedWordToSignedByte array2[n](47..32);

array1[n](63..56) = SaturateSignedWordToSignedByte array2[n](63..48);

}

void _stdcall _packsswbi(_mmxdata *array,_mmxdata *imm,int n);

Description

Each element of array1 will be packed with imm. The result is written to array1. The number of elements of array1 is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array[n](7..0) = SaturateSignedWordToSignedByte array[n](15..0);

array[n](15..8) = SaturateSignedWordToSignedByte array[n](31..16);

array[n](23..16) = SaturateSignedWordToSignedByte array[n](47..32);

array[n](31..24) = SaturateSignedWordToSignedByte array[n](63..48);

array[n](39..32) = SaturateSignedWordToSignedByte imm(15..0);

array[n](47..40) = SaturateSignedWordToSignedByte imm[n](31..16);

array[n](55..48) = SaturateSignedWordToSignedByte imm[n](47..32);

array[n](63..56) = SaturateSignedWordToSignedByte imm[n](63..48);

}

void _stdcall _packssdw(_mmxdata *array1,_mmxdata *array2,int n);

Description

Each element of array1 will be packed with the corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array1[n](15..0) = SaturateSignedDwordToSignedWord array1[n](31..0);

array1[n](31..16) = SaturateSignedDwordToSignedWord array1[n](63..32);

array1[n](47..32) = SaturateSignedDwordToSignedWord array2[n](31..0);

array1[n](63..48) = SaturateSignedDwordToSignedWord array2[n](63..32);

}

void _stdcall _packssdwi(_mmxdata *array,_mmxdata *imm,int n);

Description

Each element of array1 will be packed with the corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array1[n](15..0) = SaturateSignedDwordToSignedWord array1[n](31..0);

array1[n](31..16) = SaturateSignedDwordToSignedWord array1[n](63..32);

array1[n](47..32) = SaturateSignedDwordToSignedWord imm(31..0);

array1[n](63..48) = SaturateSignedDwordToSignedWord imm(63..32);

}

Pack with unsigned saturation

void _stdcall _packuswb(_mmxdata *array1,_mmxdata *array2,int n);

Description

Each element of array1 will be packed with the corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array1[n](7..0) = SaturateSignedWordToUnsignedByte array1[n](15..0);

array1[n](15..8) = SaturateSignedWordToUnsignedByte array1[n](31..15);

array1[n](23..16) = SaturateSignedWordToUnsignedByte array1[n](47..32);

array1[n](31..24) = SaturateSignedWordToUnsignedByte array1[n](63..48);

array1[n](39..32) = SaturateSignedWordToUnsignedByte array2[n](15..0);

array1[n](47..40) = SaturateSignedWordToUnsignedByte array2[n](31..16);

array1[n](55..48) = SaturateSignedWordToUnsignedByte array2[n](47..32);

array1[n](63..56) = SaturateSignedWordToUnsignedByte array2[n](63..48);

}

void _stdcall _packuswbi(_mmxdata *array,_mmxdata *imm,int n);

Description

Each element of array1 will be packed with imm. The result is written to array1. The number of elements of array is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array[n](7..0) = SaturateSignedWordToUnsignedByte array[n](15..0);

array[n](15..8) = SaturateSignedWordToUnsignedByte array[n](31..15);

array[n](23..16) = SaturateSignedWordToUnsignedByte array[n](47..32);

array[n](31..24) = SaturateSignedWordToUnsignedByte array[n](63..48);

array[n](39..32) = SaturateSignedWordToUnsignedByte imm[n](15..0);

array[n](47..40) = SaturateSignedWordToUnsignedByte imm[n](31..16);

array[n](55..48) = SaturateSignedWordToUnsignedByte imm[n](47..32);

array[n](63..56) = SaturateSignedWordToUnsignedByte imm[n](63..48);

}

Packed Add

Packed add byte

void _stdcall _paddb(_mmxdata *array1,_mmxdata *array2,int n);

Description

Each element of array1 will be added with each corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array1[n](7..0) = array1[n](7..0) + array2[n](7..0);

array1[n](15..8) = array1[n](15..8) + array2[n](15..8);

array1[n](23..16) = array1[n](23..16)+ array2[n](23..16);

array1[n](31..24) = array1[n](31..24) + array2[n](31..24);

array1[n](39..32) = array1[n](39..32) + array2[n](39..32);

array1[n](47..40) = array1[n](47..40)+ array2[n](47..40);

array1[n](55..48) = array1[n](55..48) + array2[n](55..48);

array1[n](63..56) = array1[n](63..56) + array2[n](63..56);

}

void _stdcall _paddbi(_mmxdata *array1,_mmxdata *imm,int n);

Description

Each element of array1 will be added with imm. The result is written to array1. The number of elements of array is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array[n](7..0) = array[n](7..0) + imm[n](7..0);

array[n](15..8) = array[n](15..8) + imm[n](15..8);

array[n](23..16) = array[n](23..16) + imm[n](23..16);

array[n](31..24) = array[n](31..24) + imm[n](31..24);

array[n](39..32) = array[n](39..32) + imm[n](39..32);

array[n](47..40) = array[n](47..40) + imm[n](47..40);

array[n](55..48) = array[n](55..48) + imm[n](55..48);

array[n](63..56) = array[n](63..56) + imm[n](63..56);

}

Packed add word

void _stdcall _paddw(_mmxdata *array1,_mmxdata *array2,int n);

Description

Each element of array1 will be added with each corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array1[n](15..0)<--array1[n](15..0) + array2[n](15..0);

array1[n](31..16)<--array1[n](31..16) + array2[n](31..16);

array1[n](47..32)<--array1[n](47..32) + array2[n](47..32);

array1[n](63..48)<--array1[n](63..48) + array2[n](63..48);

}

void _stdcall _paddwi(_mmxdata *array1,_mmxdata *imm,int n);

Description

Each element of array1 will be added with imm. The result is written to array1. The number of elements of array is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array[n](15..0)<--array[n](15..0) + imm(15..0);

array[n](31..16)<--array[n](31..16) + imm(31..16);

array[n](47..32)<--array[n](47..32) + imm(47..32);

array[n](63..48)<--array[n](63..48) + imm[n](63..48);

}

Packed add double word

void _stdcall _paddd(_mmxdata *array1,_mmxdata *array2,int n);

Description

Each element of array1 will be added with each corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array1[n](31..0)<--array1[n](31..0) + array1[n](31..0);

array1[n](63..32)<--array1[n](63..32) + array1[n](63..32);

}

void _stdcall _padddi(_mmxdata *array,_mmxdata *imm,int n);

Each element of array1 will be added with imm. The result is written to array1. The number of elements of array is given by ‘n’.

Mode of operation:

while (n-- > 0) {

array[n](31..0)<--array1[n](31..0) + imm(31..0);

array[n](63..32)<--array1[n](63..32) + imm(63..32);

}

Packed Add with saturation

Packed add byte with saturation

a) Signed variants

void _stdcall _paddsb(_mmxdata *array1,_mmxdata *array2,int n);

void _stdcall _paddsbi(_mmxdata *array1,_mmxdata *array2,int n);

b) Unsigned variant

void _stdcall _paddusb(_mmxdata *array1,_mmxdata *array2,int n);

void _stdcall _paddusbi(_mmxdata *array1,_mmxdata *array2,int n);

Description

Each element of array1 will be added with each corresponding element of array2. The result is written to array1. The number of elements of both arrays is given by ‘n’.

For the signed operation, if the result of the add is saturated to 0x7f or to 0x80 in case of overflow/underflow respectively.

For the unsigned operation, the saturation values are 0xFF and 0x00 in case of overflow/underflow.

Packed add word with saturation

void _stdcall _paddsw(_mmxdata *array1,_mmxdata *array2,int n);

void _stdcall _paddswi(_mmxdata *array1,_mmxdata *imm,int n);

Description

Same operation as in paddsb above. The saturation values are 0x7FFF and 0x8000 for the signed operation, and 0xFFFF and 0x00 for signed / unsigned operations.

Packed And.

void _stdcall _pand(_mmxdata *array1,_mmxdata *array2,int n);

void _stdcall _pandi(_mmxdata *array1,_mmxdata *imm,int n);

The bitwise logical AND operation is done between each 64 bit element of the arrays. The result is written to the array1.

Packed And. Not

void _stdcall _pandn(_mmxdata *array1,_mmxdata *array2,int n);

void _stdcall _pandni(_mmxdata *array1,_mmxdata *imm,int n);

First a bitwise logical NOT on the 64 bits of each element is performed, inverting each bit of the source operand(array2). Then,the bitwise logical AND operation is done between each 64 bit element of the arrays. The result is written to the array1.

Replicate

void _stdcall _replicatebyte(_mmxdata *dst,unsigned char c);

void _stdcall _replicateword(_mmxdata *dst,unsigned short w);

void _stdcall _replicatedword(_mmxdata *dst,unsigned int i);

This instructions replcate either a byte, a word or a double word into the mmx data pointed to by the ‘dst’ argument. Its use is essentially meant for comparisons.

The first 64 bits of the first argument will be filled with the given integer, either as bytes, words, or double words.

Example:

_replicatebyte(&mmxdata,’ ‘);

Then, mmxdata will contain 8 spaces, and can be later used as an argument for comparison functions.

Reduce

int _stdcall _reduceBooleanb(_mmxdata *map,int n);

int _stdcall _reduceCmpeqb(_mmxdata *map,_mmxdata *imm,int n);

int _stdcall _reduceGtb(_mmxdata *map,_mmxdata *imm,int n);

int _stdcall _reduceLtb(_mmxdata *map,_mmxdata *imm,int n);

This instructions add a boolean vector counting the non zero members and return a 32 bit integer with the result.

_reduceBooleanb, sums all true bytes (11111111) in a logical vector that is the result of a previous comparison.

_reduceCmpeqb makes a comparisons and then adds the hits

_reduceLtb and _reduceGtb test for Greater than or less than, and add up the ‘true’ bytes.

‘True’ bytes are those set to all ones (11111111b, or 0xFFH or 255 decimal) by a previous mmx logical operation.

Example:

If the mm data element space contains a set of 8 space bytes (32), the following will count the number of spaces in the character vector ‘data’:

_reduceCmpeqb(data,&space,len/8);

[1]I would better use ‘p’ for parallel, but this is a matter of taste...